NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Demand-driven provisioning of Kubernetes-like resources in OSG

https://doi.org/10.1051/epjconf/202429507014

Sfiligoi, Igor; Würthwein, Frank; Dost, Jeff; Lin, Brian; Schultz, David (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The OSG-operated Open Science Pool is an HTCondor-based virtual cluster that aggregates resources from compute clusters provided by several organizations. Most of the resources are not owned by OSG, so demand-based dynamic provisioning is important for maximizing usage without incurring excessive waste. OSG has long relied on GlideinWMS for most of its resource provisioning needs but is limited to resources that provide a Grid-compliant Compute Entrypoint. To work around this limitation, the OSG Software Team has developed a glidein container that resource providers could use to directly contribute to the OSPool. The problem with that approach is that it is not demand-driven, relegating it to backfill scenarios only. To address this limitation, a demand-driven direct provisioner of Kubernetes resources has been developed and successfully used on the NRP. The setup still relies on the OSG-maintained backfill container image but automates the provisioning matchmaking and successive requests. That provisioner has also been extended to support Lancium, a green computing cloud provider with a Kubernetes-like proprietary interface. The provisioner logic has been intentionally kept very simple, making this extension a low-cost project. Both NRP and Lancium resources have been provisioned exclusively using this mechanism for many months.
more » « less
Full Text Available
IceCube experience using XRootD-based Origins with GPU workflows in PNRP

https://doi.org/10.1051/epjconf/202429511011

Schultz, David; Sfiligoi, Igor; Riedel, Benedikt; Andrijauskas, Fabio; Weitzel, Derek; Würthwein, Frank (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. Understanding detector systematic effects is a continuous process. This requires the Monte Carlo simulation to be updated periodically to quantify potential changes and improvements in science results with more detailed modeling of the systematic effects. IceCube’s largest systematic effect comes from the optical properties of the ice the detector is embedded in. Over the last few years there have been considerable improvements in the understanding of the ice, which require a significant processing campaign to update the simulation. IceCube normally stores the results in a central storage system at the University of Wisconsin–Madison, but it ran out of disk space in 2022. The Prototype National Research Platform (PNRP) project thus offered to provide both GPU compute and storage capacity to IceCube in support of this activity. The storage access was provided via XRootD-based OSDF Origins, a first for IceCube computing. We report on the overall experience using PNRP resources, with both successes and pain points.
more » « less
Full Text Available
400Gbps benchmark of XRootD HTTP-TPC

https://doi.org/10.1051/epjconf/202429501001

Arora, Aashay; Guiang, Jonathan; Davila, Diego; Würthwein, Frank; Balcas, Justas; Newman, Harvey (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
Due to the increased demand of network traffic expected during the HL-LHC era, the T2 sites in the USA will be required to have 400Gbps of available bandwidth to their storage solution. With the above in mind we are pursuing a scale test of XRootD software when used to perform Third Party Copy transfers using the HTTP protocol. Our main objective is to understand the possible limitations in the software stack to achieve the target transfer rate; to that end we have set up a testbed of multiple XRootD servers in both UCSD and Caltech which are connected through a dedicated link capable of 400 Gbps end-to-end. Building upon our experience deploying containerized XRootD servers, we use Kubernetes to easily deploy and test different configurations of our testbed. In this work, we will present our experience doing these tests and the lessons learned.
more » « less
Full Text Available
CRIU - Checkpoint Restore in Userspace for computational simulations and scientific applications

https://doi.org/10.1051/epjconf/202429507046

Andrijauskas, Fabio; Sfiligoi, Igor; Davila, Diego; Arora, Aashay; Guiang, Jonathan; Bockelman, Brian; Thain, Greg; Würthwein, Frank (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing programmatically, it would be preferable if the batch scheduling system could do that independently. This paper evaluates the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool for the GNU/Linux environments, emphasizing the OSG’s OSPool HTCondor setup. CRIU allows checkpointing the process state into a disk image and can deal with both open files and established network connections seamlessly. Furthermore, it can checkpoint traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. However, some limitations prevent it from being usable in all circumstances.
more » « less
Full Text Available
Defining a canonical unit for accounting purposes

https://doi.org/10.1145/3569951.3597574

Andrijauskas, Fabio; Sfiligoi, Igor; Würthwein, Frank (July 2023, ACM)

Full Text Available
Analyzing Transatlantic Network Traffic over Scientific Data Caches

https://doi.org/10.1145/3589012.3594897

Deng, Ziyue; Sim, Alex; Wu, Kesheng; Guok, Chin; Hazen, Damian; Monga, Inder; Andrijauskas, Fabio; Würthwein, Frank; Weitzel, Derek (July 2023, Proceedings of the 2023 on Systems and Network Telemetry and Analytics)

Full Text Available
Auto-scaling HTCondor pools using Kubernetes compute resources

https://doi.org/10.1145/3491418.3535123

Sfiligoi, Igor; DeFanti, Thomas; Würthwein, Frank (July 2022, 2022 Practice & Experience in Advanced Research Computing (PEARC22))

Full Text Available
Automated Network Services for Exascale Data Movement

https://doi.org/10.1051/epjconf/202429501009

Balcas, Justas; Newman, Harvey; Bhat, Preeti P; Würthwein, Frank; Guiang, Jonathan; Arora, Aashay; Davila, Diego; Graham, John; Hutton, Thomas; Lehman, Tom; et al (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The Large Hadron Collider (LHC) experiments distribute data by leveraging a diverse array of National Research and Education Networks (NRENs), where experiment data management systems treat networks as a “blackbox” resource. After the High Luminosity upgrade, the Compact Muon Solenoid (CMS) experiment alone will produce roughly 0.5 exabytes of data per year. NREN Networks are a critical part of the success of CMS and other LHC experiments. However, during data movement, NRENs are unaware of data priorities, importance, or need for quality of service, and this poses a challenge for operators to coordinate the movement of data and have predictable data flows across multi-domain networks. The overarching goal of SENSE (The Software-defined network for End-to-end Networked Science at Exascale) is to enable National Labs and universities to request and provision end-to-end intelligent network services for their application workflows leveraging SDN (Software-Defined Networking) capabilities. This work aims to allow LHC Experiments and Rucio, the data management software used by CMS Experiment, to allocate and prioritize certain data transfers over the wide area network. In this paper, we will present the current progress of the integration of SENSE, Multi-domain end-to-end SDN Orchestration with QoS (Quality of Service) capabilities, with Rucio, the data management software used by CMS Experiment.
more » « less
Full Text Available
The anachronism of whole-GPU accounting

https://doi.org/10.1145/3491418.3535125

Sfiligoi, Igor; Schultz, David; Würthwein, Frank; Riedel, Benedikt; Mishin, Dmitry (July 2022, Practice and Experience in Advanced Research Computing)

Full Text Available
Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches

https://doi.org/10.1145/3526064.3534111

Bellavita, Julian; Sim, Alex; Wu, Kesheng; Monga, Inder; Guok, Chin; Würthwein, Frank; Davila, Diego (June 2022, SNTA '22: Fifth International Workshop on Systems and Network Telemetry and Analytics)

Full Text Available

« Prev Next »

Search for: All records